Stationary Points of a Shallow Neural Network with Quadratic Activations and the Global Optimality of the Gradient Descent Algorithm
Abstract
We consider the problem of training a shallow neural network with quadratic activation functions and the generalization power of such trained networks. Assuming that the samples are generated by a full rank matrix of the hidden network node weights, we obtain the following results. We establish that all full-rank approximately stationary solutions of the risk minimization problem are also approximate global optimums of the risk (in-sample and population). As a consequence, we establish that, when trained on polynomially many samples, the gradient descent algorithm converges to the global optimum of the risk minimization problem regardless of the width of the network when it is initialized at some value , which we compute. Furthermore, the network produced by the gradient descent has a near zero generalization error. Next, we establish that initializing the gradient descent algorithm below is easily achieved when the weights of the ground truth matrix are randomly generated and the matrix is sufficiently overparameterized. Finally, we identify a simple necessary and sufficient geometric condition on the size of the training set under which any global minimizer of the empirical risk has necessarily zero generalization error.
Funding: The research of E. C. Kizildag is supported by Columbia University, with the Distinguished Postdoctoral Fellowship in Statistics. Support from the National Science Foundation [Grant DMS-2015517] is gratefully acknowledged.
1. Introduction
Neural network architectures are demonstrated to be extremely powerful in practical tasks, such as natural language processing (Collobert and Weston [20]), image recognition (He et al. [46]), image classification (Krizhevsky et al. [53]), speech recognition (Mohamed et al. [63]), and game playing (Silver et al. [80]), and is becoming popular in other areas, such as applied mathematics (Chen et al. [17], Weinan et al. [94]), clinical diagnosis (De Fauw et al. [21]), and so on. Despite this empirical success, a rigorous mathematical understanding of these architectures is still an ongoing quest.
Despite the fact that it is NP-hard to train such architectures in the worst case setting, it has been observed empirically that the gradient descent (GD), albeit a simple, first order, local procedure, is rather successful in training such networks. This is somewhat surprising because of the highly nonconvex nature of the associated objective function. Our main motivation in this paper is to provide further insights into the optimization landscape and generalization abilities of these networks.
1.1. Model, Contributions, and Comparison with Prior Work
In this section, we introduce the model considered in this paper, describe our contributions, and discuss the relevant literature.
1.1.1. Model.
In this paper, we consider a shallow neural network architecture with one hidden layer of width m. Namely, the network consists of m neurons. We study it under the realizable model assumption, that is, the labels are generated by a teacher network with ground truth weight matrix whose row carries the weights of the neuron and . We assume that the input data consists of independent and identically distributed (i.i.d.) centered sub-Gaussian coordinates. It is worth noting that such shallow architectures with planted weights and Gaussian input data are explored extensively in the literature (see, e.g., Brutzkus and Globerson [11], Du et al. [28], Li and Yuan [56], Soltanolkotabi [82], Tian [89], Zhong et al. [99]).
Our focus is, in particular, on networks with quadratic activation, studied also by Soltanolkotabi et al. [83] and Du and Lee [24], among others. This object, an instance of what is known as a polynomial network (Livni et al. [59]), computes for every input data the function
We note that, albeit a stylized activation function, blocks of quadratic activations can be stacked together to approximate deeper networks with sigmoid activations as shown by Livni et al. [59], and furthermore, this activation serves as a second order approximation of general nonlinear activations as noted by Venturi et al. [90]. Thus, we study the quadratic networks as an attempt to gain further insights on more complex networks.
Let be an i.i.d. collection of input data, and let be the corresponding label generated per (1). The goal of the learner is as follows: given the training data , find a weight matrix that explains the input–output relationship on the training data set in the best possible way, often by solving the so-called empirical risk minimization (ERM) optimization problem
1.1.2. Contributions.
We first study the landscape of risk functions and quantify an energy barrier separating rank-deficient matrices from the full-rank planted weights. Specifically, if is full-rank, namely, rank d (recall ), then the risk function for any rank-deficient W is bounded away from zero by an explicit constant—independent of d—controlled by the smallest singular value of as well as the second and fourth moments of the data. See Theorem 1 for the population and Theorem 2 for the empirical versions of this result. (Theorem 2 holds with high probability (w.h.p.) with respect to the observed sample.)
Next, we study the full-rank stationary points of the risk functions and the performance of the gradient descent algorithm. We first establish that, when is full rank, any full-rank stationary point W of the risk functions is necessarily a global minimum and any such W is of form , where is orthonormal. See Theorem 3 for the population and Theorem 4 for the empirical versions. Namely, W is a global optimum up to a rotation. We then establish that all approximate stationary points (appropriately defined) W of below the aforementioned energy barrier are nearly global optimum. Furthermore, we establish that, if the number N of samples is , then the weights W of any full-rank approximate stationary point are uniformly close to . As a corollary, gradient descent with initialization below the energy barrier recovers in time a solution W for which the weights are ϵ-close to the planted weights. Consequently, the generalization error for this solution W is at most ϵ. The bound on is derived by controlling the condition number of a certain matrix whose i.i.d. rows consists of tensorized data , using a recently developed machinery in Emschwiller et al. [33] studying the spectrum of expected covariance matrices of tensorized data. See Theorem 5 for the population and Theorem 6 for the empirical version.
Subsequently, we study the question of whether one can find the initialization of the gradient descent algorithm below the aforementioned energy barrier. We answer affirmatively this question in the context of randomly generated and establish in Theorem 8 that, as long as the network is sufficiently overparameterized, specifically , for some sufficiently large constant C, it is possible to initialize W0 such that w.h.p. the risk associated to W0 is below the required threshold. This is achieved by using tools from random matrix theory, specifically a semicircle law for Wishart matrices that shows the spectrum of is tightly concentrated (Bai and Yin [3]). See Theorem 7 for the population and Theorem 8 for the empirical version. It is worth noting that neural networks with random weights is an active area of research by itself because of the relationship with random feature methods. For example, Rahimi and Recht [73] show that shallow architectures trained by choosing the internal weights randomly and optimizing only over the output weights return a classifier with reasonable generalization performance at accelerated training speed. Random shallow networks are also shown to well-approximate dynamical systems (Gonon et al. [40]), have been successfully employed in the context of extreme learning machines (Huang et al. [48]), and are studied in the context of random matrix theory; see Pennington and Worah [69] and references therein.
Our next focus is on the sample complexity for generalization. Even though we study the landscape of the empirical risk, it is not by any means certain that any (potentially not full-rank) optimizer of also achieves zero generalization error. We give necessary and sufficient conditions on the samples so that any minimizer has indeed zero generalization error in our setting. We show that, if is the space of all d × d-dimensional real symmetric matrices, then any global minimum of the empirical risk is necessarily a global optimizer of the population risk and, thus, has zero generalization error. Note that this geometric condition is not retrospective in manner: it can be checked ahead of the optimization procedure by computing . Conversely, we show that, if the preceding span condition is not met, then there exists a global minimum W of the empirical risk function that induces a strictly positive generalization error. This is established in Theorem 9.
To complement our analysis, we then ask the following question: what is the critical number of the training samples under which the (random) data enjoys the aforementioned span condition? We prove this number to be under a very mild assumption that the coordinates of are jointly continuous. This is shown in Theorem 10. Finally in Theorem 11, we show that, when , not only does there exist W with zero empirical risk and strictly positive generalization error, we also bound this error from below by an amount very similar to the bound for rank-deficient matrices discussed in our earlier Theorem 2.
We end with a comment on overparameterization and generalization. A common paradigm in statistical learning theory is that overparameterized models, that is, models with more parameters than necessary, although capable of interpolating the training data, tend to generalize poorly because of overfitting to the proposed model. Yet it is observed empirically that neural networks tend not to suffer from this complication (Zhang et al. [96]): despite being overparameterized, they seem to have a good generalization performance provided the interpolation barrier is exceeded. In Theorem 9(a), we establish the following result, which sheds some light on this phenomenon for the case of shallow neural networks with quadratic activations: suppose that the data enjoys the aforementioned geometric condition. Then, any interpolator achieves zero generalization error even when the interpolator is a neural network with a potentially larger number of internal nodes compared with the one that generated the data, namely, by using a weight matrix , where . In other words, the model does not overfit when a much larger width of the interpolator is chosen at the learning state.
1.1.3. Comparison with Soltanolkotabi et al. [83] and Du and Lee [24].
We now make a comparison with two very related prior works, also studying quadratic activations. We start with the work by Soltanolkotabi et al. [83]. In Soltanolkotabi et al. [83, theorem 2.2], the authors study the empirical risk landscape of a slightly more general version of our model: , assuming like us and assuming all nonzero entries of have the same sign. Thus, our model is the special case in which all entries of are equal to unity. The authors establish that, as long as for some small fixed constant c, every local minima of the empirical risk function is also a global minima (namely, there exists no spurious local minima), and furthermore, every saddle point has a direction of negative curvature. As a result, they show that gradient descent with an arbitrary initialization converges to a globally optimum solution of the ERM problem (2). In particular, their result does not require the initialization point to be below some risk value (the energy barrier) as in our case. Nevertheless, our results show that one needs not to worry about saddle points below the energy barrier as none exists per our Theorem 2. Importantly, though, the regime for small c that Soltanolkotabi et al. [83, theorem 2.2] applies is below the provable sample complexity value when the data are drawn from a continuous distribution as per our Theorem 10. In particular, as we establish when , the ERM problem (2) admits global optimum solutions with zero empirical risk value but with the generalization error bounded away from zero. Thus, the regime does not correspond to the regime in which solving the ERM has a guaranteed control on the generalization error. The same theorem in Soltanolkotabi et al. [83] also studies the approximate stationary points and shows that, for any such point W, the associated empirical risk, , is also small. Our Theorem 6, though, takes a step further and shows that not only is the empirical risk small, but the recovered W is close to planted weights , and therefore, it has a small generalization error by explicitly bounding the generalization error from above.
It is also worth noting that, although not our focus in the present paper, Soltanolkotabi et al. [83, theorem 2.1] also studies the landscape of the empirical risk when a quadratic network model is used for interpolating arbitrary input/label pairs , that is, without making an assumption that the labels are generated according to a network with planted weights. They establish similar landscape results, namely, the absence of spurious local minima and the fact that every saddle point has a direction of negative curvature as long as the output weight has at least d positive and at least d negative entries (consequently, the width m has to be at least 2d). Even though this result does not assume any rank condition on W, the assumption on the minimum number of positive and minimum number of negative entries, such as the preceding one, is somewhat unnatural.
Yet another closely related work studying quadratic activations is the paper by Du and Lee [24], which focuses on the shallow architectures with all unity output weights as we do. This paper establishes that, for any smooth and convex loss , the landscape of the regularized loss function still admits aforementioned favorable geometric characteristics. Furthermore, as the learned weights are of bounded Frobenius norm thanks to the norm penalty imposed on the objective, they retain good generalization via Rademacher complexity considerations. Even though this work addresses the training and generalization error when the norm of W is controlled during training, it does not carry out approximate stationarity analysis as Soltanolkotabi et al. [83] and we do and does not study their associated loss/generalization as in our case. Even though they show that optimal solutions to the optimization problem incorporating bounded norms generalize well, it remains unclear from their analysis whether the approximate stationary points of this objective also have a well-controlled norm.
It is worth mentioning that the two main directions that we undertake in this paper were not explored in either of these two prior works. These include the direction pertaining to the initialization (Theorems 7 and 8) and the direction pertaining to the sample complexity (Theorems 9–11). The latter direction relates to an interesting interpolation/overparameterization property that we have discussed before. We return later to this direction in Section 2.3 after we present Theorem 9.
1.1.4. Connection to Matrix Sensing.
The problem of learning shallow quadratic networks with planted weights is closely related to the matrix sensing problem. This problem has many applications, spanning from video and image processing to control and sensor networks (see Deng et al. [23], Qin et al. [72], Zhong et al. [97] and references therein); we now elaborate on the connection between our setting and the matrix sensing problem. Given an unknown matrix , the goal of the matrix sensing problem is to recover from a (small) number of linear measurements of form , where , and . Recalling (1), our data model corresponds to
Namely, the problem that we study in the present paper is an instance of the matrix sensing problem; the goal is to recover from rank-1 measurements of form , where have centered i.i.d. sub-Gaussian coordinates. A related prior work by Zhong et al. [97] studies the problem of sensing a low-rank matrix from rank-1 measurements of form , where are i.i.d. and they consist of centered i.i.d. sub-Gaussian coordinates (see also Deng et al. [23] for an improvement of Zhong et al. [97] in terms of the time and sample complexity). Notice that Zhong et al. [97] studies a slightly different measurement matrix of form (as opposed to ). More importantly Zhong et al. [97] are concerned with recovering a low-rank , whereas the fact that is full-rank is crucial for our techniques; see subsequent details. A very recent work by Qin et al. [72] relaxes the low-rank assumption and studies measurement matrices of form for that are nearly orthogonal and give recovery guarantees for the stochastic gradient descent. Note that, for as before, standard concentration arguments show that the collection Xi, , consists (w.h.p.) of nearly orthogonal vectors. Therefore, the setting studied in Qin et al. [72] is very similar to ours, and their sample complexity guarantee regarding the gradient descent appears to have a better dependence on dimension than ours. On the other hand Qin et al. [72] focus only on the problem of finding an such that
1.1.5. Further Relevant Prior Work.
As noted in the introduction, neural networks achieved remarkable empirical success, which fueled research starting from the expressive ability of these networks, going as early as Barron [5]. More recent works along this front focus on deeper and sparser models; see, for example, Mhaskar et al. [62], Telgarsky [88], Eldan and Shamir [31], Schmidt-Hieber [78], Poggio et al. [70], and Bölcskei et al. [10]. In particular, the expressive power of such network architectures is relatively well understood. Another issue pertaining to such architectures is their computational tractability: Blum and Rivest [9] establish that it is NP-complete to train a very simple three-node network whose nodes compute a linear thresholding function. Despite this worst case result, it is observed empirically that local search algorithms, such as GD, are rather successful in training. Even though several authors, including Sedghi and Anandkumar [79], Janzamin et al. [49], and Goel et al. [38], devise provable training algorithms for such networks, these algorithms unfortunately are based on methods other than the gradient descent, thus not shedding any light on its apparent empirical success.
On a parallel front, many papers study the behavior of the GD by analyzing the trajectory of it or its stochastic variant (SGD) under certain stylistic assumptions on the data as well as the network. These assumptions include Gaussian inputs, shallow networks (with or without the convolutional structure), and the existence of planted weights (the so-called teacher network) generating the labels. Some partial and certainly very incomplete references to this end include Tian [89], Brutzkus and Globerson [11], Brutzkus et al. [12], Zhong et al. [98], Soltanolkotabi [82], Li and Yuan [56], and Du et al. [28]. Later work relaxes the distributional assumptions. For instance, Du et al. [25] study the problem of learning a convolutional unit with ReLU with no specific distributional assumption on input and establish the convergence of SGD with rate depending on the smoothness of the input distribution and the closeness of the patches. Several other works along this line, in particular under the presence of overparameterization, are the works by Du et al. [26, 27].
Yet another line of research on the optimization front, rather than analyzing the trajectory of the GD, focuses on the mean-field analysis. Empirical distribution of the parameters of network with infinitely many internal nodes can be described as a Wasserstein gradient flow, and thus, some tools from the theory of optimal transport can be used; see, for example, Wei et al. [93], Rotskoff and Vanden-Eijnden [75], Chizat and Bach [18], Song et al. [84], and Sirignano and Spiliopoulos [81]. Albeit explaining the story to some extent for infinitely wide networks, it remains unclear whether these techniques provide results for a more realistic network model with finitely many internal nodes.
As noted earlier, the optimization landscape of such networks is usually highly nonconvex. More recent research on such nonconvex objectives shows that, if the landscape has certain favorable geometric properties, such as the absence of spurious local minima and the existence of direction with negative curvature for every saddle point, local methods can escape the saddle points and converge to the global minima. Examples of this line of research on loss functions include Ge et al. [37], Levy [55], Lee et al. [54], Jin et al. [50], and Du et al. [29]. Motivated by this front of research, many papers analyze geometric properties of the optimization landscape, including Poston et al. [71], Haeffele et al. [43], Choromanska et al. [19], Haeffele and Vidal [42], Kawaguchi [51], Hardt and Ma [44], Soudry and Carmon [85], Freeman and Bruna [34], Zhou and Feng [100], Nguyen and Hein [67, 68], Ge et al. [36], Safran and Shamir [77], Soudry and Hoffer [86], Zhou and Liang [101], Venturi et al. [90], Du et al. [24], and Soltanolkotabi et al. [83].
We now touch upon yet another very important focus, that is, the generalization ability of such networks: how well a solution found, for example, by GD, predicts unseen data? A common paradigm in statistical learning theory that was mentioned previously is that overparameterized models tend to generalize poorly. Yet neural networks tend to not suffer from this complication (Zhang et al. [96]). Because the Vapnik–Chervonenkis dimension of these networks grows (at least) linear in the number of parameters (Bartlett et al. [7], Harvey et al. [45]), standard Vapnik–Chervonenkis theory does not help explaining the good generalization ability under presence of overparameterization. This is studied, among others, through the lens of the norm of weight matrices (Bartlett et al. [6], Dziugaite and Roy [30], Golowich et al. [39], Liang et al. [58], Neyshabur et al. [65], Wu et al. [95]), PAC-Bayes theory (Neyshabur et al. [64, 66]), and compression-based bounds (Arora et al. [1]). A main drawback is that these papers require some sort of constraints on the weights and are mostly a posteriori: whether a good generalization takes place can be determined only when the training process is finished. A recent work by Arora et al. [2] provides an a priori guarantee for the solution found by the GD. Our result regarding the generalization guarantee described in Theorem 11 also provides simple a priori guarantee on the generalization.
1.1.6. A Follow-up Work.
After our paper appeared on arXiv, a follow-up work was done by Mannelli et al. [60]. In this paper, the authors consider the same architecture (namely, a shallow network with quadratic activations) under the so-called teacher/student setting and study the landscape of the empirical risk as well as the performance of the gradient flow, the continuous-time analogue of the gradient descent. Importantly, they consider also the regime in which the number of the hidden units of the teacher network is less than the dimension d (whereas our focus is on ). In particular, when , the width m of the student network is at least d, and the data consists of i.i.d. standard normal entries; they prove the following. In the limit as , if , then with positive probability the only minimizer of the empirical risk is the matrix of teacher weights itself, whereas for , the empirical risk admits spurious minima with probability tending to one. (Namely, the geometry of the empirical risk undergoes a phase transition as crosses ). Moreover, they also prove that, for , the gradient flow converges to a global minima of the empirical risk and to the global minimum of the population risk (which is ) and characterize the rate of convergence for the latter case. (It is worth noting that running gradient flow on the population risk can be perceived as running it on the empirical risk in the limit of the large number n of samples.)
1.1.7. Paper Organization.
In Section 2.1, we present our main results on the landscape of the risk functions, including our energy barrier result for rank-deficient matrices, our result about the absence of full-rank stationary points of the risk function except the globally optimum points, and our result on the convergence of gradient descent. In Section 2.2, we present our results regarding randomly generated weight matrices and sufficient conditions for good initializations. In Section 2.3, we study the critical number of training samples guaranteeing good generalization property. We collect useful auxiliary lemmas in Section 3 and provide the proofs of all of our results in Section 4.
1.1.8. Notation.
The set of reals, positive reals, and the set are denoted, respectively, by , and . For any matrix A, its smallest and largest singular values, spectrum, trace, Frobenius, and spectral norm are denoted, respectively, by , and . We denote the n × n identity matrix by In. Planted weights are denoted with an asterisk, for example, . denotes . Given any denotes its Euclidean norm . Given two vectors , their Euclidean inner product is denoted by . Given a collection of objects of the same kind (e.g., vectors or matrices), is the set, . We say a random variable X is centered if . , and are standard (asymptotic) order notations for comparing the growth of two sequences. , and denote, respectively, the empirical risk, its gradient, and the population risk and its gradient.
2. Main Results
Our main results are now in order.
2.1. Optimization Landscape
2.1.1. Existence of an Energy Barrier.
Our first result shows the presence of an energy barrier in the landscape of the population risk below which any rank-deficient ceases to exist.
Suppose that has i.i.d. centered coordinates with variance μ2, (finite) fourth moment μ4, , and let be defined as (3).
(Lower bound) It holds that
(Tightness) There exists a matrix such that and
The proof of Theorem 1 is deferred to Section 4.2. Two remarks are in order.
First, the hypothesis of Theorem 1 holds under mild distributional assumptions on the coordinates of data: a finite fourth moment and zero mean suffices.
Second, part (b) of Theorem 1 implies that our lower bound on the energy value is tight up to a multiplicative constant determined by the moments of the data. That is, there exists a W with such that , where the asymptotic hides the constants μ2 and μ4.
Our next result is an analogue of Theorem 1 for the empirical risk , and it establishes the presence of a similar energy barrier in the landscape of the empirical risk below which any rank-deficient ceases to exist with high probability.
Let K > 0 be an arbitrary constant and be a collection of i.i.d. random vectors each having centered i.i.d. sub-Gaussian coordinates. That is, for some C > 0, for every . Suppose, furthermore, that, for every M > 0, the distribution of X(j), conditional on , is centered: . Let be the corresponding label generated by a planted teacher network per (1), where and . Then, for some absolute constants with probability at least
Here,
Furthermore, if the dimension d of data is constant (), then with probability ,
The proof of Theorem 2 is provided in Section 4.9. Several remarks are now in order.
Assuming d is large, Theorem 2 shows that, with high probability, is bounded away from zero by an explicit constant for any W that is rank-deficient provided , where the O(1) term depends on K. Furthermore, provided N is a sufficiently large polynomial-in-d quantity, the probability estimate is of form , which is exponential in the dimension.
Note that one indeed needs a finite d correction for the case when the data are low-dimensional, : for , the term makes the probability estimate vacuous. The constant appearing in this case is precisely the same constant appearing in Theorem 1, and in particular, no conditioning is required. Furthermore, even though we establish the probability estimate to be for simplicity, it can be improved: it appears, from our analysis (which uses Chebyshev’s inequality), that for any , one can show the probability estimate to be . Furthermore, using more elaborate tools (such as concentration for heavy-tailed variables, in particular, for i.i.d. averages of fourth moments of sub-Gaussian variables), this estimate can potentially be improved to for suitable constants . Finally, our analysis yields also that the probability estimate still remains valid even when Xi has centered i.i.d coordinates with a finite eight moment. That is, the sub-Gaussianity assumption can be relaxed. Indeed, the expression is an i.i.d. sum of the form for a suitable matrix M, and one needs the finiteness of to apply Chebyshev’s inequality: this quantity is finite provided and . We do not pursue these extensions for keeping the presentation clear.
We now discuss the assumption that the coordinates of Xi are conditionally centered. This assumption is not inherent to the problem; it is adopted purely for technical reasons. Toward Theorem 2, one needs to establish the existence of an analogous energy barrier for a single rank-deficient and should in particular study an event of form
We next comment on the dependence on through the constant K. The (value of) energy barrier is independent of this gap, whereas the probability guarantee does depend on it. This is an artifact of our proof technique. Our proof is based on an ϵ-net argument for a set of rank-deficient matrices with bounded together with a union bound. The cardinality bound we employ on this net per Lemma 3 depends on dK; consequently, this constant appears in the probability guarantee. Once again, it might be possible to avoid this by using a different argument as outlined earlier.
An inspection of the proofs of Theorems 1 and 2 yield that the landscapes of the corresponding risks still admit an energy barrier even if we consider the same network architecture with planted weight matrix and quadratic activation function having lower order terms, that is, the activation with . In this case, in addition to and the corresponding moments of the data, the coefficient α also appears in the barrier expression. In particular, Theorem 1 still remains valid with replaced with , and Theorem 2 still remains valid with replaced with .
2.1.2. Global Optimality of Full-Rank Stationary Points.
We now establish that, if W is a full-rank stationary point of the population risk, , then W is necessarily a global minimum.
Suppose with . Suppose has centered i.i.d. coordinates with , and . Let be a stationary point of the population risk with full-rank, that is, , and . Then, for some orthogonal matrix Q and that .
The proof of Theorem 3 is deferred to Section 4.6.
Our next result is an analogue of Theorem 3 for the empirical risk, , and shows that, if and W is any full-rank stationary point of the empirical risk, then W is necessarily a global minimum.
Let , with , and suppose W is a full-rank stationary point of the empirical risk: , and . Then, . Furthermore, if Xi are i.i.d. random vectors having a jointly continuous distribution and , then with probability one, for some orthogonal matrix .
The proof of Theorem 4 is given in Section 4.10.
Note that an implication of Theorems 3 and 4 is that the corresponding losses admit no full-rank saddle points. Namely, the landscape of the corresponding losses has fairly benign properties below the aforementioned energy barrier. We soon show how this implies the convergence of gradient descent in the next section.
2.1.3. Convergence of Gradient Descent.
We now combine Theorems 1 and 3 to obtain the following conclusion on the performance of the gradient descent algorithm for the population risk. Suppose that the algorithm is initialized at a point W0 having a small population risk , in particular, lower than the smallest risk value achieved by the rank-deficient matrices. Then, with a properly chosen step size, the algorithm converges to a global optimum: it generates a trajectory of weights such that .
Let be a matrix of weights with the property that
Define
The proof of Theorem 5 is provided in Section 4.7.
Our next focus is on the performance of the gradient descent algorithm for the empirical risk. By combining Theorems 2 and 4, we obtain the following conclusions. Suppose that the gradient descent algorithm is initialized at a point with a sufficiently small empirical risk, in particular, lower than the smallest risk value achieved by rank-deficient matrices (i.e., the energy barrier), and fix an . Then, with a properly chosen step size, it finds an approximate stationary point W (that is, a with a small ) in time for which the weights WTW are uniformly ϵ-close to the planted weights , and consequently, the generalization error is at most ϵ. Furthermore, the algorithm converges to a global optimum of the empirical risk minimization problem , which is zero, thus recovering planted weights because of the absence of spurious stationary points within the set of full-rank matrices.
Let be arbitrary. Suppose that satisfies the assumptions in Theorem 2; is a matrix of weights with the property
For any W with , . Moreover, .
Running gradient descent algorithm starting from W0 with a step size of generates a full-rank with in time . Furthermore, for this W, .
For W in (b), it holds that , and the generalization error is at most ϵ provided .
Furthermore, suppose the data dimension d is constant, , and is a matrix of weights with the property for the same constant appearing in Theorem 2. Then, for ζ chosen per (4), with probability at least ,
For any W with , it is the case that . Moreover, .
A running gradient descent algorithm starting from W0 with a step size of generates a full-rank W with and .
For any W in (b), .
The proof of Theorem 6 is provided in Section 4.11. Several remarks are now in order.
As a simple corollary to (b), we, thus, obtain that, for d large enough, the gradient descent algorithm with initialization and a step size of generates a trajectory of weights such that . This yields an analogue of Theorem 5 for the empirical risk, .
Note that, when N is a sufficiently large polynomial-in-d function, the probability estimate for the first part is essentially . Furthermore, we note that the exponent 1/4 in the probability estimate and the sample bound are required only for part (c) and can potentially be improved. In particular, the exponent can be improved to one for parts (a), (b), and (d).
We now expand on the sample complexity per part , which appears rather poor. In light of Soltanolkotabi et al. [83], a much smaller sample complexity of suffices to obtain an approximate stationary point W. Our work, on the other hand, takes a step further: we control the deviation of WTW from and, subsequently, bound the generalization error. As we mention earlier, we are unaware of similar guarantees in the literature; the prior work, in fact, seems to focus only on minimizing the training error. This extension is based on a certain technical result established in Emschwiller et al. [33] (see the following for details), and the sample bound is likely to be an artifact of this approach. Having said this, it might though be possible to improve the dependence on dimension d. As a potential avenue, one may consider leveraging techniques from the phase retrieval literature, such as PhaseLift (Candès and Li [13], Candès et al. [15], Demanet and Hand [22], Eldar and Mendelson [32]), or from the matrix sensing literature mentioned earlier. It appears that some of these algorithms solve a certain convex relation, and the fact that the underlying matrix to be recovered is low rank is crucial for this convex program. In our case, and, therefore, are full rank, so it is not clear whether these algorithms would apply as is. In fact, as mentioned earlier, closely related works in the matrix sensing literature address a (variant of) an empirical risk minimization problem though they do not quantify how close the obtained solution and the ground truth are. The connection between our model and the phase retrieval/matrix sensing literature is very intriguing; it is plausible that the algorithms in the latter domain might apply to our case. This direction merits further investigation and is left for future work.
Note again that, analogous to Theorem 2, one needs a correction for the finite d case because the term makes the probability estimate vacuous for . Furthermore, the remarks following Theorem 2 still remain valid. In particular, the choice of is for simplicity, and the estimate can be improved almost immediately to for any and to for some using more delicate concentration tools.
We now provide an important remark pertaining to (c): as we show in the proof, provided N grows at least polynomially in d, with probability over Xi, it holds that, for any W having a small risk, , WTW is close to : is small. Consequently, is small. This is one of the additional technical results of our paper and is achieved by controlling the condition number of a certain matrix whose i.i.d. rows consist of the tensorized data . The proof uses a recent work analyzing the spectrum of expected covariance matrices of tensorized data (Emschwiller et al. [33, theorem 5.1]). It is worth highlighting that the independence of the coordinates of Xi is crucial for establishing Theorem 6. Our results leverage techniques in Emschwiller et al. [33] and require, in particular, that the coordinates of Xi are samples from an admissible distribution in the sense of Emschwiller et al. [33, definition 2.4, theorem 5.1]; they would no longer be valid if Xi had potentially dependent coordinates.
These results concern the performance of gradient descent assuming the initialization is proper, that is, it is below the aforementioned energy barrier. One can then naturally ask whether such an initialization is indeed possible in some generic context. In the next section, we address this question of proper initialization when the (planted) weights are generated randomly in order to complement Theorems 5 and 6. We establish that such a proper initialization is indeed possible by providing a deterministic initialization guarantee, which, with high probability, beats the aforementioned energy barrier.
Before we close this section, we remark on the full-rank assumption, . This assumption can be relaxed for Theorems 1 and 2, and one can establish a similar energy barrier for all W with . For instance, suppose and . Using an eigenvalue interlacing argument (see, e.g., Fulton [35, equation 11]), one can subsequently modify the proofs of Theorems 1 and 2 and obtain a similar energy barrier for all W whose rank is strictly below that of planted . On the other hand, the full-rank assumption appears necessary for the results regarding stationary points and global optimality (Theorems 3 and 4) and the subsequent convergence guarantees regarding the gradient descent (Theorems 5 and 6). In fact, a similar full-rank assumption to rule out spurious local minima is also employed in the very related work Soltanolkotabi et al. [83, theorem 2.2] discussed earlier. One of their results (Soltanolkotabi et al. [83, theorem 2.1]) seems to avoid this though at the expense of a rather unnatural assumption that the output layer contains at least d positive and d negative entries. Inspecting their proof as well as ours, a full-rank assumption or the assumption that the output layer has d positive and d negative entries seem crucial: to study the gradient of a stationary point and subsequently declare global optimality, it appears necessary to show that the kernel of a certain matrix arising in the gradient calculation is trivial. It would be interesting to see to which extent this assumption can be relaxed or whether it can be avoided altogether; we leave this as an interesting open direction for future work.
2.2. On Initialization: Randomly Generated Planted Weights
Our results in the previous section show that, provided the initialization of the gradient descent method occurs below the critical energy, the algorithm converges to the global minimum. This raises the question whether such an initialization can be found in a constructive way.
In this section, we show that the answer is yes in the setting of randomly generated weights of the ground truth matrix . Specifically, we provide a way to properly initialize such networks under the assumption that the (planted) weight matrix has arbitrary i.i.d. centered entries with unit variance and a finite fourth moment and the data has centered i.i.d. sub-Gaussian coordinates. (It is worth mentioning that, similar to before, the sub-Gaussianity assumption on the data are required only for the case of empirical risk, and the corresponding population risk result holds under a milder distributional assumption; see Theorem 7.) Our result is valid provided that the network is sufficiently overparameterized: for some large constant C. Note that this implies is a tall matrix sending into . The rationale behind this approach is as follows. The value of the risk is determined by the spectrum of and the moments of the data distribution. Under our randomness assumption, the so-called Wishart matrix is tightly concentrated around a multiple of the identity if m is sufficiently large. Hence, one can control the spectrum of Δ and, therefore, the loss functions ( and ) by properly choosing the initialization W.
We now state the main results of this section, starting with the population risk version.
Suppose that the data consists of i.i.d. centered coordinates with and . Recall from (3).
(Gaussian case) Suppose that the planted weight matrix has i.i.d. standard normal entries. Let the initial weight matrix be defined by for and otherwise; that is, with . Then, provided for a sufficiently large absolute constant C > 0,
with probability at least , where the probability is with respect to the draw of .(General case) Suppose the planted weight matrix has centered i.i.d. entries with unit variance and a finite fourth moment. Let the initial weight matrix be defined by for , and otherwise; that is, . Then, provided for a sufficiently large absolute constant C > 0,
with high probability as , where the probability is with respect to the draw of .
The proof of this theorem is provided in Section 4.8.
Note that, part (a) of Theorem 7 gives an explicit rate for probability in the case when the i.i.d. entries of the planted weight matrix are standard normal and is based on a nonasymptotic concentration result for the spectrum of such matrices. The extension in part (b) is based on a semicircle law obtained by Bai and Yin [3].
The corresponding result for the empirical risk is provided as follows.
Suppose that the planted weight matrix has centered i.i.d. entries with unit variance and a finite fourth moment; the (i.i.d.) data , has i.i.d. centered sub-Gaussian coordinates (namely, for some C > 0, for any t > 0, and ) and the satisfies for and for , that is, . Then, for some absolute constants with probability at least
The proof of Theorem 8 is provided in Section 4.12.
It is worth noting that, unlike the earlier Theorems 2 and 6, Theorems 7 and 8 do not have a separate statement for the case of finite d (): in order to ensure the concentration property for the Wishart ensemble takes place, one should consider the regime . That is, our initialization results do not hold for the regime in which d is constant.
We highlight that, for low-rank models, such as the problem of sensing a low-rank matrix described earlier as well as learning quadratic networks with a low-rank weight matrix , prior work proposes spectral initialization (Gunasekar et al. [41], Khodak et al. [52], Li et al. [57], Stöger and Soltanolkotabi [87], Vodrahalli et al. [92]) and analyzes the performance of gradient descent in recovering the planted weights. More concretely, for recovering an unknown rank-r matrix , these works propose initializations of the form UUT, where . For our results, in particular, the global optimality of certain stationary points (Theorems 3 and 4) as well as the convergence guarantees for the gradient descent (Theorems 5 and 6), the fact that is full rank is crucial. Even though such a low-rank initialization would not suffice for the goal of finding a W0 with loss below the energy barrier, it would be interesting to analyze the performance of such an initialization for a higher number of iterations. If successful, such an initialization would enable us to relax the randomness assumption on ; we leave this as an interesting future direction.
We now turn our attention to the number of training samples required to learn such models.
2.3. Critical Number of Training Samples
In this section, we explore the smallest number of training data for which a small empirical risk controls the generalization error.
2.3.1. A Necessary and Sufficient Geometrical Condition.
We start by identifying a necessary and sufficient geometrical condition on the data under which any minimizer of the empirical risk has zero generalization error. For our setting, any such minimizer necessarily interpolates the data.
Let be a set of data. Let (P) be the property that , where is the set of all d × d symmetric real matrices.
Suppose that (P) holds. Then, for any and satisfying , we have . In particular, if , then for some with .
Suppose that (P) does not hold. Then, for any with and any , there exists a such that, whereas . In particular, almost surely with respect to any jointly continuous distribution on .
The proof of Theorem 9 is provided in Section 4.13.
Note that the property (P), , is a purely geometrical necessary and sufficient condition; it can be checked ahead of the optimization (not retrospective). Under (P), any minimizer of empirical risk has perfect generalization. Conversely, when (P) does not hold, there are global optima of that (almost surely) have a strictly positive generalization error. Furthermore, Theorem 9 applies to any arbitrary data set Xi (not necessarily random). We highlight the extension that, when is positive though small, one can control via Theorem 5(c) and bound . For a more refined version of Theorem 9, see Theorem 11.
We next highlight the parameter . Note that, when (P) holds, any network with nodes interpolating the data necessarily has zero generalization error. This holds even in the overparameterized regime, . Namely, once the data are interpolated, overparameterization does not hurt the generalization ability. This is an instance of an empirically observed (and extensively studied) phenomenon regarding neural networks that defies classical statistical wisdom.
Finally, Theorem 9 still holds if each node has an associated positive output weight , that is if the network computes . This is easily seen by pushing inside .
2.3.2. Randomized Data Enjoys the Geometric Condition.
We now identify the smallest number of training samples such that random data almost surely satisfies (P) for .
Let and be i.i.d. random vectors with a jointly continuous distribution. Then,
If , then .
If , then for arbitrary .
Theorem 10 is likely folklore; we provide a proof in Section 4.14 for completeness. Observe that part (b) is straightforward: as , (P) cannot hold when .
2.3.3. Sample Complexity Bound for the Planted Network Model.
Combining Theorems 9 and 10, we arrive at the following sample complexity result.
Let be i.i.d. samples drawn from a jointly continuous distribution on and , where with .
Suppose , and . Then, with probability one over , the following holds. If with , then for all .
Suppose are i.i.d., each with centered i.i.d. entries having variance μ2 and a finite fourth moment μ4. Suppose . Then, there exists a such that , but .
The proof of Theorem 11 is provided in Section 4.15.
The lower bound in part (b) is very similar to the energy barrier for rank-deficient matrices per Theorems 1(a) and 2. Moreover, the interpolating network in (a) can be overparameterized. Theorem 11 provides the necessary and sufficient number of data points for training a shallow quadratic network so as to guarantee perfect generalization property.
2.3.4. Related Work.
We now discuss a related line of work on polynomial activations that seek necessary and sufficient conditions and/or the smallest sample size under which the loss functions satisfy various favorable properties. Venturi et al. [90] study a key topological property of the loss function, namely, the presence/absence of spurious valleys. Similar to us, they identify necessary and sufficient conditions independent of the data distribution. One of their results (Venturi et al. [90, corollary 9]) shows that the empirical risk admits no spurious valleys provided the network is sufficiently overparameterized: , where N is the sample size. (A close variant of this result is in fact shown earlier by Livni et al. [59]; see Venturi et al. [90] for a comparison.) Moreover, they also establish a similar guarantee regarding the population risk for quadratic networks; see Venturi et al. [90, theorem 12]. Similar analysis for quadratic networks is conducted also by Soltanolkotabi et al. [83] and Du et al. [24] mentioned earlier. In particular, Soltanolkotabi et al. [83] establish the absence of spurious local minima for when ; Du et al. [24] establish that a regularized version of the empirical risk has no spurious minima when (a) m > d or (b) and m < d. Yet another related work by Mannelli et al. [60] show the following when the data has i.i.d. standard normal entries. For , the only minimizer of the empirical risk is a planted weight matrix, whereas for , the empirical risk admits spurious local minima, both w.h.p. as .
3. Preliminaries
We collect herein several useful auxiliary results that we employ in our proofs.
3.1. An Analytical Expression for the Population Risk
Toward proving our energy barrier results, Theorems 2 and 1, we start with providing an analytical expression for the population risk of any in terms of how close it is to the planted weight matrix .
We recall that a random vector X in is defined to have a jointly continuous distribution if there exists a measurable function such that, for any and Borel set ,
Let be the function computed by (1) and be similarly the function computed by (1) for . Recall,
Suppose the distribution of X is jointly continuous. Then, , that is, almost surely with respect to X if and only if for some orthonormal matrix . Suppose now that the coordinates of are i.i.d. with , and .
It holds that
where and is the Hadamard product of A with itself. In particular, if has i.i.d. standard normal coordinates, we obtain .The following bounds hold:
and
The proof of Theorem 12 is provideed in Section 4.1.
In a nutshell, Theorem 12 states that the population risk of any is completely determined by how close it is to the planted weights as measured by the matrix and the second and fourth moments of the data. This is not surprising: is essentially a function of the first four moments of the data and the difference of the quadratic forms generated by W and , which is precisely encapsulated by the matrix A. Note also that the characterization of the optimal orbit per part (a) is not surprising either: any matrix W with the property , where is an orthonormal matrix, that is, , has the property that for any data . Part (a) then says the reverse is true as well provided that the distribution of X is jointly continuous. Note also that, for X with centered i.i.d. entries, the thesis of part (a) follows also from part (c): implies that , which, together with the fact that A is symmetric, then yields A = 0; that is, .
3.2. Useful Lemmas and Results from Linear Algebra and Random Matrix Theory
Our next result is a simple norm bound for the ensemble with sub-Gaussian coordinates.
Let be an i.i.d. collection of random vectors with centered i.i.d. sub-Gaussian coordinates, that is, for some constant C > 0, for every , and . Then,
The proof of Lemma 1 is provided in Section 4.3.
Our energy barrier result Theorem 2 for the empirical risk is proven by establishing the emergence of a barrier for a single rank-deficient together with a covering numbers argument.
Let be an arbitrary constant and be a collection of i.i.d. data with centered i.i.d. sub-Gaussian coordinates for which for any M > 0, the mean of conditional on is zero, and let be the corresponding label generated by a neural network with planted weights as per (1), where . Fix any , where , and . Define the event
In particular,
The parameter K1 appearing in Lemma 2 controls the amount of truncation applied on training data, and K2 controls the norm of the planted weight matrix. The proof of Lemma 2 is provided in Section 4.4.
The next result is a covering number bound adopted from Candès and Plan [14, lemma 3.1] with minor modifications.
Let
Then, there exists a net for SR in the Frobenius norm (that is, for every , there exists an such that ) such that
The proof of Lemma 3 is provided in Section 4.5.
Some of our results use the following well-known results. These results are verbatim from the literature and provided herein without proof.
(
(
Our results regarding the initialization guarantees use the several auxiliary results from random matrix theory: the spectrum of tall random matrices are essentially concentrated.
(
(
The following concentration result, recorded herein verbatim from Vershynin [91], is useful for our approximate stationarity analysis.
(
Here, c > 0 is an absolute constant.
Finally, we make use of the matrix-operator version of Hölder’s inequality.
(
4. Proofs
In this section, we present the proofs of the main results of this paper.
The order of the proofs presented herein is slightly different from the order of the corresponding results in the main body in that none of the following proofs (with one exception that we detail) use a proof presented later than itself. That is, whenever we present the proof of a result, it is ensured that, if this proof requires another result as a building block, this building block is shown earlier. The rationale behind this is to avoid any potential confusion and to ensure that no cyclic reasoning is present.
With this arrangement, only Theorem 4 uses results presented later in this section (more precisely, it uses Theorems 9 and 10), and it can be checked directly that there is no cyclic reasoning in the proof of Theorem 4.
4.1. Proof of Theorem 12
First, we have
Recall Theorem 13. In particular, if , then we have almost surely. Because a polynomial, it then follows that P(X) = 0 identically. Now, because A is symmetric, it has real eigenvalues, called with corresponding (real) eigenvectors . Now, taking , we have . Because , we get for any i. Finally, because , it must necessarily be the case that A = 0. Hence, , which implies for some orthonormal, per Theorem 14.
Using Equation (5), we first have
Note that, if , then because X has centered i.i.d. coordinates. Keeping this in mind and carrying out the algebra, we then get
Using now Equations (6) and (7), we get
because .Define k to be such that , namely, k is related to measures of dispersion pertaining to Xi: is the coefficient of variation, and is the kurtosis associated to the random variable Xi. With this, we have
From here, the desired conclusion follows because
4.2. Proof of Theorem 1
Note first that using Theorem 12 part (c), we have
Now, fix any with . Let be the eigenvalues of , be the eigenvalues of , and be the eigenvalues of . Because W is rank deficient, we have . Furthermore, because the eigenvalues of are precisely the squares of the singular values of . Now, recall the (Courant–Fischer) variational characterization of the eigenvalues (Horn and Johnson [47]). If M is a d × d matrix with eigenvalues , then
With this, fix an with . Then,
Because this inequality holds for every x with , we can take the max over all x and arrive at
Now, because are precisely the eigenvalues of A2, we have . Hence, for any W with , it holds that
Finally, because , the desired conclusion follows by taking the minimum over all rank-deficient W.
Let the eigenvalues of be denoted by with the corresponding orthogonal eigenvectors . Namely, diagonalize as , where the columns of are and is a diagonal matrix with for every . Let
Observe that , where is a diagonal matrix with for every and and that . Now, let be the rows of and fix a such that .
Having constructed a , we now prescribe as follows. For , let , where Wi is the ith row of W. Then, set , and for every , set . For this matrix, we now claim
To see this, fix an and recall that . We now compute this quantity more explicitly:
Hence, for every , we have
Now, let . Note that is symmetric and for every . Now, taking X to be ei, that is, the ith element of the standard basis for the Euclidean space , we deduce for every . For the off-diagonal entries, let . Then, , which, together with the fact that the diagonal entries of Ξ are zero, imply , namely, Ξ is skew-symmetric. Finally, because Ξ is also symmetric, we have , which then implies for every , that is, Ξ = 0, and thus, .
Hence, we have for with ,
Using now the upper bound provided by Theorem 12 part (c) yields the desired claim. Therefore, the energy band lower bound is tight up to a multiplicative constant.
4.3. Proof of Lemma 1
For any fixed , note that using sub-Gaussian property one has ; thus, , using union bound, which yields the conclusion.
4.4. Proof of Lemma 2
Let
By Lemma 1, . Now, note that
We now study ; hence, assume we condition on from now on. The triangle inequality yields
Observe now that
Now, the Cauchy–Schwarz inequality with respect to the inner product yields
Next, let , and let be the eigenvalues of , all nonnegative. Observe that
Now, note that are the eigenvalues of . With this reasoning, we have
Consequently, , and therefore, the exact same reasoning yields
We now apply concentration to i.i.d. sum
Now, recalling the distributional assumption on the data, we have that, conditional on , the data still has i.i.d. centered coordinates. In particular, the energy barrier result for the population risk as per Theorem 1 applies:
Finally applying Hoeffding’s inequality for bounded random variables, we arrive at
Returning to (8), this yields
This completes the proof of Lemma 2.
4.5. Proof of Lemma 3
The proof is almost verbatim from Candès and Plan [14, lemma 3.1], and included herein for completeness.
Note that any , and decomposes as , where , satisfying and , a diagonal matrix with nonnegative diagonal entries. Notice, furthermore, that as Q is orthonormal. With this, we now construct an appropriate net covering the set of all permissible Q and Σ.
Let D be the set of all r × r diagonal matrices with nonnegative diagonal entries with a Frobenius norm at most R. Let be an net for D in the Frobenius norm. Using standard results (see, e.g., Vershynin [91, lemma 5.2]), we have
Now, let . To cover , we use a more convenient norm defined as
Clearly,
We now claim is indeed a net for SR in the Frobenius norm. To prove this, take an arbitrary , and let . There exists a and a such that , and . Now, let . Observe that, using the triangle inequality,
For the first term, note that, because Q is orthonormal, . Next,
As a side remark, observe that we gain an extra factor of two in the exponent owing to the fact that A is positive semidefinite (otherwise, the bound would be ).
4.6. Proof of Theorem 3
We first establish the following proposition for any W, which is a stationary point of the population risk.
Let be a diagonal matrix with and define analogously. Then, enjoys the stationarity equation
To that end, fix a and . Note that , using the dominated convergence theorem. Next, implies that, for every and ,
Note next that computes as
We now put this object into a more convenient form. Notice that the preceding expression is
Observe that . We now study and more carefully. Observe that . Now, let be a diagonal matrix in which if i = j and zero otherwise. We then have .
We now study . Recall that is the ith row . Observe that . Hence,
In particular, stationarity yields
Having now established Proposition 1 for the stationarity equation, we now study its implications for any full-rank W.
Let be a stationary point with . We first establish . Because is a stationary point, it holds that, for every . In particular, Equation (9) holds.
Recalling now that W is full rank, it follows from the rank-nullity theorem that is trivial, that is, . Hence, for matrices M1, M2 (with matching dimensions), whenever WM1 = WM2 holds, we deduce M1 = M2 because each column of is contained in . Thus, Equation (9) then yields
Next, note that , and similarly, . In particular, taking traces of both sides in Equation (10), we get
Now, suppose . Note that inspecting the preceding (i, i) coordinate, we get
Because , we then get
Now, focus on off-diagonal entries by fixing . Observe that, because , it also holds that . Now, note that, in this case. We then have
We conclude that the matrix is a zero matrix. Hence, for some orthonormal per Theorem 14, and .
4.7. Proof of Theorem 5
Let be a sequence of m × d matrices corresponding to the weights along the trajectory of gradient descent, that is, is the weight matrix at iteration t of the algorithm. We first show . To see this, recall Theorem 12(c): , where . In particular, this yields . Hence, for any W with , it holds that
Namely, the (Frobenius) norm of the weights of any W with remains uniformly bounded from above. This, in turn, yields that the (spectral norm of the) Hessian of the objective function remains uniformly bound from above for any such W because the objective is a polynomial function of W, which is precisely what we denote by L.
We now run gradient descent with a step size of : a second order Taylor expansion reveals that
In particular, , and furthermore, , where is the spectral norm of the Hessian matrix . From here, we induct on k: the induction argument reveals that we can retain a step size of , and furthermore, we deduce that the gradient descent trajectory is such that (i) for every , and furthermore, (ii) it holds for every that
We now establish that as . Note that the objective function is lower bounded (by zero). If the gradient is nonvanishing, then (by passing to a subsequence if necessary) each step reduces the value of the objective function at least by a certain amount, that is (uniformly) bounded away from zero. But this contradicts the fact that the objective is lower bounded. Thus, we deduce
Now, recall that the trajectory is such that and as . Suppose that the initial value, , is such that
In particular, for every ,
Therefore, is full rank for all k, per Theorem 1. We now establish
To see this, observe that the sequence is monotonic (nonincreasing) and, furthermore, is bounded by zero from below. Hence,
Because the weights remain bounded along the trajectory, it follows that there exists a subsequence with a limit, that is, as , where . Now, the continuity of , together with the continuity of the norm , imply that . Furthermore, continuity of then implies . Now, because s are such that for all and is strictly smaller than the rank-deficient energy barrier, by taking limits as and using (11), we conclude that is full rank. Because is also a stationary point of the loss, by Theorem 3, we deduce , which yields as desired.
4.8. Proof of Theorem 7
4.8.1. Part (a).
Let . Then, using Theorem 15, it holds that, with probability ,
Recall that denotes the spectrum of A, that is, . We claim then the spectrum of is . To see this, simply note the following line of reasoning:
Now, let be such that with . In particular, if are the eigenvalues of with ; then, it holds that
Now, recall by Theorem 12(c) that
For the first term, note first that, if are the eigenvalues of , then
Finally, using the triangle inequality,
Using now the overparameterization , we further have
Note that, for the constant ,
Next, observe that, for m large (in the regime with C large enough). Thus, using what we establish in Theorem 1, we arrive at
Finally, observe also that, if , then as well: indeed observe that, if , then Xi = 0 almost surely, for which . In particular, . Equipped with this, we then observe that provided
4.8.2. Part (b).
Note that the result of Bai and Yin [3] asserts that, if are the eigenvalues of
Note that . Hence, we obtain
Now, note that
We now use CLT as and . To that end, let be fixed. Observe now that, for any arbitrary M > 0, and sufficiently large d,
In particular, using the central limit theorem, we deduce
Hence,
Moreover,
4.9. Proof of Theorem 2
First, let
We start with the following claim.
In the setting of Theorem 2, the following holds. With probability at least (where is some absolute constant), it holds that, for any with ,
For convenience, let , and for the random data vector , let . Recall that X has i.i.d. centered coordinates with a sub-Gaussian coordinate distribution.
We have the following, in which the implication is due to Cauchy–Schwarz:
We now establish that, with probability at least , the following holds provided : for every ,
To see this, we begin by noticing . Using this, we have
We now use Hölder’s inequality (Theorem 18) with and . This yields
Observing now , we have
Hence,
This yields, for any W with ,
Furthermore, . This yields
We now take and conclude that
Having established Claim 1, we now return to the proof of Theorem 2. Let
A consequence of Claim 1 is that . We now establish the following.
Note that combining Claims 1 and 2 through a union bound yields
Let . We claim . To see this, note that . Let be the eigenvalues of A, all nonnegative as , and are the eigenvalues of A2. With this,
Hence, as requested.
Next, let
Now, applying Lemma 2 with and and taking a union bound across the net , we arrive at the following conclusion:
Now, because by Lemma 1, we obtain
In the remainder of the proof, suppose for every ,
Now, let with . Let (thus, ) and be such that . We now estimate
For notational convenience, let . Now,
Now, using Cauchy–Schwarz (for inner product ),
For the term , we observe that the triangle inequality yields
Thus,
Using these, we obtain
Because it is already noted that Claims 1 and 2 together yield Theorem 2, we complete the proof of Theorem 2.
4.9.1. Case of Constant d: .
We now carry out a separate analysis for the case of constant d (). We only point out the necessary modifications, hiding factors depending on the constant d under asymptotic notations.
In what follows, we use the fact that, if X is a sub-Gaussian random variable, then for every . For a proof, see Vershynin [91, lemma 5.5], which establishes a stronger conclusion that for every .
4.9.2. Modifying Claim 1.
Claim 1 modifies to the following: with probability at least , it holds that, for any with ,
We now sketch the proof of this modified claim. For a matrix , denote by and let be arbitrary. Then, we show that, over the randomness in with probability at least ,
Indeed, fix an . Then, for any ,
This is by Chebyshev’s inequality (here, ). Here, we used, in particular, the fact that . Likewise, for any ,
Taking a union bound over these events corresponding to the entries of matrix , we are done. (Note that the number of events, , is O(1). This, and other factors depending on ϵ are hidden under .) Using the trivial bound , valid for any matrix , we arrive at the operator norm bound
Finally, taking , we finish the proof of modified Claim 1.
4.9.3. Modifying Claim 2.
Claim 2 now modifies to , and this modified version is shown as follows. Note that, for any , the size of the net we consider is O(1). We claim that, with probability , it holds that, for any ,
To show this, fix an . Now, instead of Lemma 2, one can apply Chebyshev’s inequality:
Putting these together as in the proof of Theorem 2, we complete the proof.
4.10. Proof of Theorem 4
We start by computing . Taking derivatives with respect to the jth row Wj of , we arrive at
Interpreting these gradients as a row vector and aggregating into a matrix, we then have
Assume now that , and . We then arrive at
We now claim that . To see this, we take a route similar to Soltanolkotabi et al. [83, lemma 6.1]. Let , and consider the function
Observe that is quadratic in M. Thus, for any with , that is,
Namely, W is indeed a global optimizer of . Because makes the cost zero, we obtain .
Now, using Theorem 10, we obtain that is the set of all d × d symmetric matrices with probability one provided . In this case, using Theorem 9, we conclude that , concluding the proof.
4.11. Proof of Theorem 6
4.11.1. Part (a).
Note that, by Claim 1, it follows that, with probability at least , it is the case that, for any W with . Now, let
Note that the . Thus, on the event , which holds with probability at least , we have that
4.11.2. Part (b).
Suppose that the event (where and are defined, respectively, in (13) and (14)) takes place. We run the gradient descent with a step size of . A second order Taylor expansion reveals that
Now, let T be the first time for which , namely, the horizon required to arrive at an stationary point. In what follows, we carry out our analysis in terms of ζ. At the end, we incorporate the bound (4) on ζ.
We claim .
To see this, note that, from the definition of T, it holds that as . Now, a telescoping argument together with reveals
Using now , we conclude . Because is at most polynomial in d as per (12), we conclude .
We now turn our attention to bounding its risk. Let . Note that . Now,
Using the Cauchy–Schwarz inequality, we have
Next, , using the fact that . In particular, on the event defined as per (13), we conclude that , and therefore, . This, together with and the triangle inequality, then yields
With this, we now turn our attention to bounding
We establish that, for the event
This is almost a straightforward modification of the proof of the earlier energy barrier result Theorem 2, and we only point out required modifications. Take any with . In particular,
Inspecting now the proof of Theorem 1(a), we obtain that, for such a W,
Using now a covering numbers bound, in the same manner as in the proof of Theorem 2, we conclude that
Now, suppose in the remainder of this part that the event , which is
Inspecting the proof of Theorem 4, we observe that
Thus, we arrive at
Let
Note now that
Next, we have
We now combine these findings.
We now use the bounds on as per (13), on as per (14), and on as per (17) to control . We conclude by the union bound that, with probability at least
Finally, because
4.11.3. Part (c).
Let be such that . Define the matrix
We bound , which ensures weights WTW are uniformly close to ground truth weights defined as . We start by conditioning: assume in the remainder that the event in (14) stating for every is true; this holds with probability at least as per Lemma 1.
Note that
To this end, consider a matrix , consisting of i.i.d. rows in which the row of Ξ is . Next, let
Next, we have
We start with the second term. Recall that , and we condition on . Next, using the Cauchy–Schwarz inequality,
Hence,
We now control . This is done in a manner similar to the proof of Emschwiller et al. [33, theorem 3.2]. The main tool is the result Theorem 17 for concentration of the spectrum of random matrices with i.i.d. nonisotropic rows. The parameter setting we operate under is provided as follows.
|
| Parameter | Value |
|---|---|
| m | d 2 |
| t | |
| δ | |
| γ |
Start by verifying that, because we condition on , it is indeed the case that the norm of each row of Ξ is at most d; thus, the preceding value of m works.
We now claim . To prove this, it suffices to show
Using Emschwiller et al. [33, theorem 5.1] (also see Remark 2) with k = 2, we obtain for some absolute constant c > 0 depending only on the data coordinate distribution. Consequently,
We now claim
This is equivalent to establishing
Using again Emschwiller et al. [33, theorem 5.1], we have for some absolute constant f > 0. This yields
The rest is verbatim from Emschwiller et al. [33]. We now apply Theorem 17. With probability at least (here, c > 0 is an absolute constant), it holds that
Now, for ,
Now, using the Courant–Fischer variational characterization of the smallest singular value (Horn and Johnson [47]), we obtain
We now return to (19) to specifically bound . Let A be any matrix A. Note that . Indeed, taking the singular value decomposition and observing , we obtain . This, together with (23), yields
We now have all ingredients to execute the bound in (19). Combining Equations (21) and (24), we get
Because with probability at least , we have that
We now show the generalization ability. For any , using auxiliary result Theorem 12(c), we have
Finally, because
The argument presented uses Emschwiller et al. [33, theorem 5.1]. Even though that result is stated for distributions supported on , it still applies under the weaker assumption that the distribution has finite moments of all orders; see Emschwiller et al. [33, remark 5.5].
4.11.4. Case of Constant d: .
We provide a very brief sketch for the argument in the case . The argument is quite similar to the one in Theorem 2. Similar to the analysis (of case) conducted for Theorem 2, we use the fact that, if X has a sub-Gaussian random variable, then for every , and in particular, for all ; see Vershynin [91, lemma 5.5] for a more precise statement.
Next, the upper bound on the energy value now modifies to .
4.11.5. Part (a).
Note that part (a) for the case of general d follows from earlier Claim 1. For the case when d is constant, part (a) now follows from modified Claim 1, provided under Theorem 2 for the case . That is, for the event
Furthermore,
Combining these, we find that
4.11.6. Part (b).
The analysis for time horizon T remains intact. Furthermore, the entire analysis leading to (15) remains (nearly) intact, and this equation now modifies to
4.11.7. Part (c).
The modification for this part is as follows. First, we do not condition on as earlier. Instead, we apply Chebyshev’s inequality (elaborated subsequently). The entire analysis leading up to (20) remains the same. Note now that . For each coordinate of this vector, we have, using Chebyshev’s inequality,
Next, to control , we do not need a delicate concentration result (such as Theorem 17) as earlier. Instead, we take the following route.
Fix to be tuned. Recall the notation from earlier, where is the row of matrix and recall that are i.i.d. random vectors. Using the outer product representation of matrix multiplication as before, we have
Consequently,
Because , a simple application of Chebyshev’s inequality together with a union bound over entries (of ) yields
The rest of the analysis remains intact except that the probability bounds are now modified to . Finally, recalling the bound (4) on ζ, we find that, with probability ,
4.12. Proof of Theorem 8
Let , and let . In what follows, recall the quantities from the proof of Theorem 7(b): , where is the semicircle law. Fix now an arbitrary and a K > 0.
We start by defining several auxiliary events:
Note that from the proof of Theorem 7(b) that we have for i = 1, 2, 3, and from the union bound and sub-Gaussianity of X, . Thus,
In what follows, suppose we condition on the event . Note that, in this conditional universe, it is still the case that Xi, are i.i.d. random vectors with centered i.i.d. coordinates. Using now Hölder’s inequality (Theorem 18) with , and , we arrive at
Namely, is the population risk in the conditional universe.
Next, in this conditional space, using Theorem 12(c), we arrive at
Finally, carrying out the same analysis as in the end of the proof of Theorem 7, we deduce
4.13. Proof of Theorem 9
Let , the set of all d × d symmetric matrices, and let be such that for any i, . We establish M = 0. Let be two fixed indices. To that end, let be such that , where the column vectors are, respectively, the kth and elements of the standard basis for . Such indeed exist because of the spanning property. Observe that . Now, using the fact that for all matrices A, B, C (with matching dimensions), we have
for every . Finally, if W is such that , then for any i, where . Hence, provided that the geometric condition holds, we have M = 0, that is, . From here, the final conclusion follows per Theorem 14. Because , W clearly has zero generalization error, that is, .Our goal is to construct a with for every , whereas . Consider the inner product in the space of all symmetric d × d matrices. Find , a symmetric matrix, such that , that is, for every . We can find such M satisfying . Consider the linear matrix function . Note that is symmetric for every δ. We claim that, under the hypothesis of the theorem, there exists a such that is positive semidefinite for every and that there exists with for all . Observe that, because , then with . Therefore, the eigenvalues of are all positive. In particular with . Now, let be the eigenvalues of . Using Weyl’s inequality (Horn and Johnson [47]), we have for every i. In particular, taking , we deduce, for every , it holds that , that is, . In particular, we also have that is symmetric, and thus, it is positive semidefinite. Thus, there exists a such that . Now, using the same idea as in the proof of Theorem 1(c), we then deduce that, for any , there exists a matrix such that . In particular, for this , if is the function computed by the neural network with weight matrix , then on the training data, because for all . At the same time, because and , and therefore, .
Finally, to show , we argue as follows. Suppose . Then, by Theorem 13, it follows that identically, where . Now, letting be the eigenvectors of A (with corresponding eigenvalues ), we obtain . We, namely, obtain for every . Finally, because A is symmetric and, hence, admits a diagonalization of form with diagonal entries of Λ being zero, we deduce A is identically zero, which contradicts with the fact that , which is a nonzero matrix.
4.14. Proof of Theorem 10
Recall that . Note that this space has dimension . For any , it is easy to see that the matrices are linearly independent, and there are precisely such matrices. With this in mind, the statement of part (b) is immediate.
We now prove part (a) of the theorem. For any Xi, let be the jth coordinate of Xi with , and let be a dimensional vector obtained by retaining and the products with . Now, let be an matrix, whose rows are . Our goal is to establish
We now prove part (b) by providing a deterministic construction (of the matrix under which . Let be distinct prime numbers. For every , set
In particular, , which then implies that is a vector of all ones. Now, we study . The entries of , called , are of form with or , where . By the fundamental theorem of arithmetic, we have , and therefore, are pairwise distinct. With this construction, the matrix is a Vandermonde matrix with determinant
Because for every (from the construction on , which, in turn, is constructed from X2), this determinant is nonzero, proving the claim.
4.14. Proof of Theorem 11
Note that, if , then, combining parts (a) of Theorems 9 and 10, we have that, with probability one, , which, together with , imply that
where from which the desired conclusion follows.Assume W is taken as in proof of Theorem 9(b), that is,
with . Let be the spectrum of the matrix δM. Using now Theorem 12(c), we have the lower boundbecause . Finally, because (as the spectral norm of M is one), we arrive at the desired conclusion.
The authors thank the anonymous reviewers for their very detailed feedback, which improved the presentation of this paper, and Orestis Plevrakis for providing useful remarks on the initial version of this paper. Part of this work was done when D. Gamarnik and E. C. Kızıldağ were visiting the Simons Institute for the Theory of Computing at the University of California, Berkeley in Fall 2020.
References
- [1] (2018) Stronger generalization bounds for deep nets via a compression approach. Preprint, submitted February 14, https://arxiv.org/abs/1802.05296.Google Scholar
- [2] (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. Preprint, submitted January 24, https://arxiv.org/abs/1901.08584.Google Scholar
- [3] (1988) Convergence to the semicircle law. Ann. Probab. 16(2):863–875.Crossref, Google Scholar
- [4] (1993) Limit of the smallest eigenvalue of a large dimensional sample covariance matrix. Ann. Probab. 21(3):1275–1294.Crossref, Google Scholar
- [5] (1994) Approximation and estimation bounds for artificial neural networks. Machine Learn. 14(1):115–133.Crossref, Google Scholar
- [6] (2017) Spectrally-normalized margin bounds for neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 6240–6249.Google Scholar
- [7] (2019) Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Machine Learn. Res. 20(63):1–17.Google Scholar
- [8] (2013) Matrix Analysis, vol. 169 (Springer Science & Business Media, Berlin, Heidelberg).Google Scholar
- [9] (1989) Training a 3-node neural network is NP-complete. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 494–501.Google Scholar
- [10] (2019) Optimal approximation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1(1):8–45.Google Scholar
- [11] (2017) Globally optimal gradient descent for a convnet with gaussian inputs. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 605–614.Google Scholar
- [12] (2017) SGD learns over-parameterized networks that provably generalize on linearly separable data. Preprint, submitted October 27, https://arxiv.org/abs/1710.10174.Google Scholar
- [13] (2014) Solving quadratic equations via phaselift when there are about as many equations as unknowns. Foundations Comput. Math. 14:1017–1026.Crossref, Google Scholar
- [14] (2011) Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory 57(4):2342–2359.Crossref, Google Scholar
- [15] (2013) Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math. 66(8):1241–1274.Crossref, Google Scholar
- [16] (2005) The zero set of a polynomial. WSMR Report 05-02, University of Windsor, Windsor, Canada.Google Scholar
- [17] (2018) Neural ordinary differential equations. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 6571–6583.Google Scholar
- [18] (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 3036–3046.Google Scholar
- [19] (2015) The loss surfaces of multilayer networks. Artificial Intelligence Statist. (PMLR, New York), 192–204.Google Scholar
- [20] (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning. Proc. 25th Internat. Conf. Machine Learn. (ACM, New York), 160–167.Google Scholar
- [21] (2018) Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature Medicine 24(9):1342–1350.Crossref, Google Scholar
- [22] (2014) Stable optimizationless recovery from phaseless linear measurements. J. Fourier Anal. Appl. 20:199–221.Crossref, Google Scholar
- [23] (2023) An improved sample complexity for rank-1 matrix sensing. Preprint, submitted March 13, https://arxiv.org/abs/2303.06895.Google Scholar
- [24] (2018) On the power of over-parametrization in neural networks with quadratic activation, Preprint, submitted March 3, https://arxiv.org/abs/1803.01206.Google Scholar
- [25] (2017) When is a convolutional filter easy to learn? Preprint, submitted September 18, https://arxiv.org/abs/1709.06129.Google Scholar
- [26] (2018) Gradient descent provably optimizes over-parameterized neural networks. Preprint, submitted October 4, https://arxiv.org/abs/1810.02054.Google Scholar
- [27] (2018) Gradient descent finds global minima of deep neural networks. Preprint, submitted November 9, https://arxiv.org/abs/1811.03804.Google Scholar
- [28] (2017) Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima. Preprint, submitted December 3, https://arxiv.org/abs/1712.00779.Google Scholar
- [29] (2017) Gradient descent can take exponential time to escape saddle points. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1067–1077.Google Scholar
- [30] (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Preprint, submitted March 31, https://arxiv.org/abs/1703.11008.Google Scholar
- [31] (2016) The power of depth for feedforward neural networks. Conf. Learn. Theory (PMLR, New York), 907–940.Google Scholar
- [32] (2014) Phase retrieval: Stability and recovery guarantees. Appl. Comput. Harmonic Anal. 36(3):473–494.Google Scholar
- [33] (2020) Neural networks and polynomial regression. Demystifying the overparametrization phenomena. Preprint, submitted March 23, https://arxiv.org/abs/2003.10523.Google Scholar
- [34] (2016) Topology and geometry of half-rectified network optimization. Preprint, submitted November 4, https://arxiv.org/abs/1611.01540.Google Scholar
- [35] (2000) Eigenvalues, invariant factors, highest weights, and Schubert calculus. Bull. Amer. Math. Soc. 37(3):209–249.Crossref, Google Scholar
- [36] (2017) Learning one-hidden-layer neural networks with landscape design. Preprint, submitted November 1, https://arxiv.org/abs/1711.00501.Google Scholar
- [37] (2015) Escaping from saddle points—Online stochastic gradient for tensor decomposition. Conf. Learn. Theory (PMLR, New York), 797–842.Google Scholar
- [38] (2016) Reliably learning the ReLU in polynomial time. Preprint, submitted November 30, https://arxiv.org/abs/1611.10258.Google Scholar
- [39] (2017) Size-independent sample complexity of neural networks. Preprint, submitted December 18, https://arxiv.org/abs/1712.06541.Google Scholar
- [40] (2020) Approximation bounds for random neural networks and reservoir systems. Preprint, submitted February 14, https://arxiv.org/abs/2002.05933.Google Scholar
- [41] (2017) Implicit regularization in matrix factorization. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 30.Google Scholar
- [42] (2015) Global optimality in tensor factorization, deep learning, and beyond. Preprint, submitted June 24, https://arxiv.org/abs/1506.07540.Google Scholar
- [43] (2014) Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing. Internat. Conf. Machine Learn. (PMLR, New York), 2007–2015.Google Scholar
- [44] (2016) Identity matters in deep learning. Preprint, submitted November 14, https://arxiv.org/abs/1611.04231.Google Scholar
- [45] (2017) Nearly-tight VC-dimension bounds for piecewise linear neural networks. Conf. Learn. Theory (PMLR, New York), 1064–1068.Google Scholar
- [46] (2016) Deep residual learning for image recognition. Proc. IEEE Conf. Comput. Vision Pattern Recognition (IEEE, Piscataway, NJ), 770–778.Google Scholar
- [47] (2012) Matrix Analysis (Cambridge University Press, Cambridge, MA).Crossref, Google Scholar
- [48] (2006) Extreme learning machine: Theory and applications. Neurocomputing 70(1–3):489–501.Crossref, Google Scholar
- [49] (2015) Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. Preprint, submitted June 28, https://arxiv.org/abs/1506.08473.Google Scholar
- [50] (2017) How to escape saddle points efficiently. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 1724–1732.Google Scholar
- [51] (2016) Deep learning without poor local minima. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 586–594.Google Scholar
- [52] (2021) Initialization and regularization of factorized neural layers. Preprint, submitted May 3, https://arxiv.org/abs/2105.01029.Google Scholar
- [53] (2012) Imagenet classification with deep convolutional neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1097–1105.Google Scholar
- [54] (2016) Gradient descent only converges to minimizers. Conf. Learn. Theory (PMLR, New York), 1246–1257.Google Scholar
- [55] (2016) The power of normalization: Faster evasion of saddle points. Preprint, submitted November 15, https://arxiv.org/abs/1611.04831.Google Scholar
- [56] (2017) Convergence analysis of two-layer neural networks with ReLU activation. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 597–607.Google Scholar
- [57] (2018) Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. Conf. Learn. Theory (PMLR, New York), 2–47.Google Scholar
- [58] (2017) Fisher-Rao metric, geometry, and complexity of neural networks. Preprint, submitted November 5, https://arxiv.org/abs/1711.01530.Google Scholar
- [59] (2014) On the computational efficiency of training neural networks. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 855–863.Google Scholar
- [60] (2020) Optimization and generalization of shallow neural networks with quadratic activation functions. Preprint, submitted June 27, https://arxiv.org/abs/2006.15459.Google Scholar
- [61] (2017) Extending the scope of the small-ball method. Preprint, submitted September 4, https://arxiv.org/abs/1709.00843.Google Scholar
- [62] (2016) Learning functions: When is deep better than shallow. Preprint, submitted March 3, https://arxiv.org/abs/1603.00988.Google Scholar
- [63] (2011) Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Language Processing 20(1):14–22.Crossref, Google Scholar
- [64] (2017) A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. Preprint, submitted July 29, https://arxiv.org/abs/1707.09564.Google Scholar
- [65] (2015) Norm-based capacity control in neural networks. Conf. Learn. Theory (PMLR, New York), 1376–1401.Google Scholar
- [66] (2017) Exploring generalization in deep learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 5947–5956.Google Scholar
- [67] (2017) The loss surface of deep and wide neural networks. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 2603–2612.Google Scholar
- [68] (2018) The loss surface and expressivity of deep convolutional neural networks (OpenReview.net).Google Scholar
- [69] (2017) Nonlinear random matrix theory for deep learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2637–2646.Google Scholar
- [70] (2017) Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review. Internat. J. Automation Comput. 14(5):503–519.Crossref, Google Scholar
- [71] (1991) Local minima and back propagation. IJCNN-91-Seattle Internat. Joint Conf. Neural Networks, vol. 2 (IEEE, Piscataway, NJ), 173–176.Google Scholar
- [72] (2023) A general algorithm for solving rank-one matrix sensing. Preprint, submitted March 22, https://arxiv.org/abs/2303.12298.Google Scholar
- [73] (2009) Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 1313–1320.Google Scholar
- [74] (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3):471–501.Crossref, Google Scholar
- [75] (2018) Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. Preprint, submitted May 2, https://arxiv.org/abs/1805.00915.Google Scholar
- [76] (1964) Principles of Mathematical Analysis, vol. 3 (McGraw-Hill, New York).Google Scholar
- [77] (2017) Spurious local minima are common in two-layer ReLU neural networks. Preprint, submitted December 24, https://arxiv.org/abs/1712.08968.Google Scholar
- [78] (2017) Nonparametric regression using deep neural networks with relu activation function. Preprint, submitted August 22, https://arxiv.org/abs/1708.06633.Google Scholar
- [79] (2014) Provable methods for training neural networks with sparse connectivity. Preprint, submitted December 8, https://arxiv.org/abs/1412.2693.Google Scholar
- [80] (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359.Crossref, Google Scholar
- [81] (2020) Mean field analysis of neural networks: A central limit theorem. Stochastic Processes Appl. 130(3):1820–1852.Crossref, Google Scholar
- [82] (2017) Learning ReLUs via gradient descent. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), 2007–2017.Google Scholar
- [83] (2018) Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. IEEE Trans. Inform. Theory 65(2):742–769.Google Scholar
- [84] (2018) A mean field view of the landscape of two-layers neural networks. Proc. Natl. Acad. Sci. USA 115(33):E7665–E7671.Google Scholar
- [85] (2016) No bad local minima: Data independent training error guarantees for multilayer neural networks. Preprint, submitted May 26, https://arxiv.org/abs/1605.08361.Google Scholar
- [86] (2017) Exponentially vanishing sub-optimal local minima in multilayer neural networks. Preprint, submitted February 19, https://arxiv.org/abs/1702.05777.Google Scholar
- [87] (2021) Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. Adv. Neural Inform. Processing Systems (MIT Press, Cambridge, MA), vol. 34, 23831–23843.Google Scholar
- [88] (2016) Benefits of depth in neural networks. Preprint, submitted February 14, https://arxiv.org/abs/1602.04485.Google Scholar
- [89] (2017) An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 3404–3413.Google Scholar
- [90] (2019) Spurious valleys in one-hidden-layer neural network optimization landscapes. J. Machine Learn. Res. 20(133):1–34.Google Scholar
- [91] (2010) Introduction to the non-asymptotic analysis of random matrices. Preprint, submitted November 12, https://arxiv.org/abs/1011.3027.Google Scholar
- [92] (2022) Nonlinear initialization methods for low-rank neural networks. Preprint, submitted February 2, https://arxiv.org/abs/2202.00834.Google Scholar
- [93] (2018) On the margin theory of feedforward neural networks. Preprint, submitted October 12, https://arxiv.org/abs/1810.05369.Google Scholar
- [94] (2017) Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Comm. Math. Statist. 5(4):349–380.Crossref, Google Scholar
- [95] (2017) Toward understanding generalization of deep learning: Perspective of loss landscapes. Preprint, submitted June 30, https://arxiv.org/abs/1706.10239.Google Scholar
- [96] (2016) Understanding deep learning requires rethinking generalization. Preprint, submitted November 10, https://arxiv.org/abs/1611.03530.Google Scholar
- [97] (2015) Efficient matrix sensing using rank-1 gaussian measurements. Algorithmic Learn. Theory: 26th Internat. Conf. Proc. (Springer, Berlin, Heidelberg), 3–18.Google Scholar
- [98] (2017) Learning non-overlapping convolutional neural networks with multiple kernels. Preprint, submitted November 8, https://arxiv.org/abs/1711.03440.Google Scholar
- [99] (2017) Recovery guarantees for one-hidden-layer neural networks. Proc. 34th Internat. Conf. Machine Learn., vol. 70 (JMLR.org), 4140–4149.Google Scholar
- [100] (2017) The landscape of deep learning algorithms. Preprint, submitted May 19, https://arxiv.org/abs/1705.07038.Google Scholar
- [101] (2017) Critical points of neural networks: Analytical forms and landscape properties. Preprint, submitted October 30, https://arxiv.org/abs/1710.11205.Google Scholar

